Phylogeographic analysis reveals an ancient East African origin of human herpes simplex virus 2 dispersal out-of-Africa

Human herpes simplex virus 2 (HSV-2) is a ubiquitous, slowly evolving DNA virus. HSV-2 has two primary lineages, one found in West and Central Africa and the other found worldwide. Competing hypotheses have been proposed to explain how HSV-2 migrated out-of-Africa (i)HSV-2 followed human migration out-of-Africa 50-100 thousand years ago, or (ii)HSV-2 migrated via the trans-Atlantic slave trade 150-500 years ago. Limited geographic sampling and lack of molecular clock signal has precluded robust comparison. Here, we analyze newly sequenced HSV-2 genomes from Africa to resolve geography and timing of divergence events within HSV-2. Phylogeographic analysis consistently places the ancestor of worldwide dispersal in East Africa, though molecular clock is too slow to be detected using available data. Rates 4.2 × 10−8−5.6 × 10−8 substitutions/site/year, consistent with previous age estimates, suggest a worldwide dispersal 22-29 thousand years ago. Thus, HSV-2 likely migrated with humans from East Africa and dispersed after the Last Glacial Maximum.

There is no detectable relationship between the year of HSV-2 sampling and the distance from the root of the phylogeny 11,15 . The human migration out-of-Africa hypothesis for HSV-2 dispersal arises from molecular clock estimates calibrated based on inference of codivergence events deeper in the simplexvirus tree that suggest a genomewide clock rate on the order of 10 −8 substitutions/site/year 5,7,8,16 . In contrast, the trans-Atlantic slave trade hypothesis is based upon a power-law rate decay of the molecular clock, whereby more recent branches have a faster clock rate than older branches 17 . The absence of a molecular clock calibrated exclusively within HSV-2 leaves these competing hypotheses unresolved.
Importantly, each of these origin hypotheses is associated with different geographic sources within Africa. If HSV-2 accompanied the earliest human migration out-of-Africa, then the most recent common ancestor (MRCA) of the worldwide diversity in lineage B would emanate from East, Central, or Southern Africa, where the first modern human migration began 10 . However, if HSV-2 left Africa by the trans-Atlantic slave trade, then this lineage would originate from West or Central Africa, where most Africans enslaved via the trans-Atlantic slave trade originated. Currently, most available African HSV-2 genomes are from East Africa 8,11,18 , limiting our ability to formally test these hypotheses.
Here, we clarify the location of the emergence of HSV-2 out-of-Africa by sequencing 65 HSV-2 genomes from 18 previously unsampled countries, including a majority of viral samples from West and Central Africa. We further narrow down the timing of the worldwide dispersal through simulation and empirical molecular clock analysis to determine which origin scenarios are compatible with a lack of clock-like signal in HSV-2. We show that HSV-2 lineage B followed human migration from East Africa, but likely postdated the initial out-of-Africa human migration event.

HSV-2 genomes
We sequenced the enriched DNA of 65 new HSV-2 isolates, including 47 with a plausible link to 13 African countries, permitting us to determine sequence across a mean of 77% of the HSV-2 genome (range: 45-91%). These 65 newly sequenced African HSV-2 genomes were combined with previously published HSV-2 genomes resulting in an alignment of 395 taxa, sampled over 47 years from 37 countries (Supplementary Data 1; Fig. 1A). The total alignment length was 105,520 nt and recombinant fragments were masked.

Maximum likelihood phylogenetic inference
We inferred a maximum likelihood phylogeny using IQTree2 19 , which supported the existence of two distinct HSV-2 lineages (Fig. 1B). These lineages are separated by a relatively long internal branch with length of 0.009 substitutions/site. The nodes defining lineage A and lineage B both had strong topological support (aBayes = 1.0; Supplementary Data 2). Both midpoint rooting and outgroup rooting using the chimpanzee herpesvirus (ChHV) supports rooting the HSV-2 tree on this long branch, consistent with previous studies 8,11 . Lineage A comprised sequences from West Africa (n = 9) and Central Africa (n = 4).
Lineage B contains 382 sequences from all sampled geographic regions, including from West and Central Africa. The basal sequences of this lineage are primarily East African. The basal structure of lineage B has strong phylogenetic support (aBayes 0.95-1.00 for the first 8 nodes; Fig. 1). The crown of lineage B, representing the worldwide diversity, has little observable geographic structure (Fig. 1B) due to the oversampling of genomes from North America and limited sampling of other regions 11,20,21 .

Clock rate inference of empirical sequences
In an attempt to estimate the timing of divergence events within HSV-2, we performed molecular clock analysis using TreeTime 22 . However, despite the inclusion of HSV-2 genomes sampled over a 47-year timespan, we inferred a negative clock rate (Fig. 1C) and a coefficient of determination (R 2 ) of the regression between year of sampling and root-to-tip distance of 0.00. These findings indicate a lack of clock-like signal, consistent with previous investigations of the HSV-2 molecular clock using a tip-dating approach 11,15 .
We first explored the possibility that this lack of signal was due either to the phylogenetically recessed branches basal to the worldwide diversity in lineage B or the long branch separating the two lineages. However, after excluding the recessed basal taxa, we still failed to detect clock-like signal. Similarly, we failed to find of clock-like signal including only the taxa associated with worldwide diversity within lineage B (Supplementary Fig. 1) or only lineage A. These sensitivity analyses all produced coefficient of determination for the molecular clock of R 2 < 0.1.

Likelihood of empirical clock rates
To understand the plausible range of evolutionary rates that would be compatible with the lack of clock-like signal in the empirical HSV-2 data, we explored the shape of likelihood surface across a series of fixed substitution rates ranging between 10 −9 and 10 −3 substitutions/ site/year to determine which molecular clock rates applied to the input ML tree have the highest likelihood and what rates are substantially worse. Given that the maximum likelihood rate estimate is negative, we expected that the slowest substitution rates (i.e., closest to a negative rate) should have the best likelihood. We restricted this analysis to rates between 10 −9 and 10 −3 substitutions/site/year, as this range encompasses previously estimated rates for a variety of viruses. As expected, the slowest rates have the best likelihood scores and the fastest rates have the worst (Fig. 2). Clock rates from 10 −9 through 4.2 × 10 −7 substitutions/site/year are within the 10 loglikelihood points of the best likelihood. Substitution rates faster than 4.2 × 10 −7 substitutions/site/year are a substantially worse fit to the observed data.
We note when we performed the same procedure using a virus with strong molecular clock signal, Ebolavirus from 2014 to 2015 West African epidemic 23 , the peak of this log-likelihood curve reasonably approximated the inferred maximum likelihood rate (Supplementary Fig. 2).

Estimating clock rate of simulated data
To explore at what substitution rates temporal signal should theoretically be detectable in our HSV-2 dataset, given empirical sampling dates and phylogenetic structure, we simulated HSV-2 genome sequences across the maximum likelihood tree under a series of substitution rates, ranging between 10 −9 and 10 −3 substitutions/site/year. We then inferred maximum likelihood tree branch lengths from these simulated data and re-inferred the clock rate. A simulated rate that had detectable temporal signal using this approach would be inconsistent with the empirical data.
We were able to reliably infer clock rates from sequences simulated at 1.0 × 10 −6 substitutions/site/year to 1.0 × 10 −4 substitutions/ site/year (Fig. 3A), indicating the true underlying HSV-2 substitution rate must be slower than these rates. We note that rates faster than 10 −4 substitutions/site/year were underestimated as the age of the root approached the earliest sample date. Based on a breakpoint analysis, for simulated rates slower than 1.0 × 10 −6 substitutions/site/year, the mean of the absolute value of inferred clock rates did not capture the simulated rate, unlike faster rates (segmented regression; Supplementary Fig. 3), suggesting these slower rates are consistent with the empirical data.
As in the empirical analysis, the estimated clock rates simulated under slower rates are often negative (Fig. 3B). Thus, sequences simulated under a clock rate which results in an estimated negative rate indicate that the true data could have been produced under that rate. The fastest rate at which more than 5% of trials infer a negative rate on the simulated data is 1.8 × 10 -6 substitutions/site/year (Fig. 3B). The fastest rate at which it is equally probable for a negative rate to be inferred as positive rate is 3.2 × 10 −7 substitutions/site/year (binomial test; p < 0.05).
The R 2 of the root-to-tip by sample date regression decreases with slower simulated clock rates (Fig. 3C), because as the clock rate (slope) approaches zero less of the divergence is explained by sampling date. More than half of the R 2 estimates were zero in simulations with a rate of 1.8 × 10 -6 substitutions/site/year or slower (Fig. 3C). The simulations of slower rates are consistent with the root-to-tip regression of the empirical HSV-2 dataset, which had an R 2 of zero.
Timing the ancestor of HSV-2 The range of substitution rates we explored is broad, accounting for scenarios in which the time of most recent common ancestor (tMRCA), or root age, for HSV-2 is older than modern humans, 6 million years ago at 10 −9 substitutions/site/year, to the similarly unrealistic 49 years ago at 10 −3 substitutions/site/year (Table 1, Supplementary Data 3). The youngest tMRCA not rejected by any of our clock-rate simulation was 19 kya, assuming 3.2 × 10 −7 substitutions/site/year (Fig. 4B). Using the slowest substitution rate that could be detected in our simulations, 2.4 × 10 −6 substitutions/site/year, we infer root age of 2.6 kya (556 BCE). Therefore, the HSV-2 tMRCA must be older than this date in order to be consistent with the lack of clock-like signal in the empirical data ( Supplementary Fig. 4). HSV-2 tMRCAs between 109 and 146 kya, which are consistent with a previously characterized East-West split in anatomically modern humans in Africa around 110-160 kya 10 , were obtained assuming a substitution rate between 4.2 × 10 −8 and 5.6 × 10 −8 substitutions/site/year, respectively (Fig. 4A). These tMRCAs are also consistent with previously published HSV-2 root age 90-120 kya, based on externally calibrated molecular clocks 7,8 .

Phylogeographic analysis indicates East African origin of worldwide lineage
We performed phylogeographic reconstruction in TreeTime on all empirical time trees inferred with fixed clock rates between 10 −9 and 10 −3 substitutions/site/year to determine the ancestral location of HSV-2 and the source of out-of-Africa migration. Using the substitution rates with HSV-2 tMRCA that is consistent with human divergence and published HSV-2 ages (4.2 × 10 −8 and 5.6 × 10 −8 substitutions/site/year), we estimate that the ancestor of lineage A was in West Africa and that the out-of-Africa dispersal event within lineage B occurred in East Africa between 29.3 to 21.9 kya ( Fig. 4 The phylogeographic placement of East Africa as the source of the HSV-2 out-of-Africa migration is robust across substitution rates (Table 1). Support for East Africa as the ancestral location of lineage B increases with slower substitution rates, exceeding 0.95 for all evolutionary rates that were not rejected by the molecular clock analyses (i.e., <2.4 × 10 −6 substitutions/site/year). Even the youngest possible estimate for the timing of HSV-2 out-of-Africa migration that is consistent the lack of clock-like signal, is 3.9 kya (1883 BCE) and supports East Africa as the source of the out-of-Africa migration ( Table 1). The out-of-Africa age estimate that is consistent with our most permissive rejection strategy was 702 years ago (1316 CE), still predating the trans-Atlantic slave trade by 200 years. Nonetheless, even at this younger age, the out-of-Africa dispersal was still inferred to be in East Africa. Furthermore, the inference of East African origin for the worldwide diversity is robust to geographic partitioning schemes of African and non-African regions (Supplementary Data 4).

Discussion
Understanding the geographic spread of human pathogens, both ancient and recent, provides insight into what conditions lead to viral establishment or extinction, informing the potential dynamics of future epidemics [24][25][26] . To understand the geography and timing of HSV-2 dispersal, we expanded the geographic breadth of HSV-2 genomic sampling and established the sensitivity of tip-dated molecular clock inference. Clock rates between 4.2 × 10 −8 -5.6 × 10 −8 substitutions/site/ year, are consistent with our molecular clock analysis of empirical data and, under these rates we infer an HSV-2 tMRCA of 110-146 kya which is consistent with the previously published timing between 90 and 120 kya 7,8 . Based on these rates, we conclude that HSV-2 most likely accompanied humans migrating from East Africa and dispersed worldwide between 22 and 29 kya (Fig. 5). The tMRCA of modern worldwide diversity, between 22 and 29 kya, is consistent with an HSV-2 strain dispersing after the population bottleneck associated with the Last Glacial Maximum 27 . This strain could have been the first HSV-2 to disperse worldwide, or it could have replaced HSV-2 that accompanied human migrations and later went extinct (as has been reported in HSV-1) 28 .
HSV-2 has ancient zoonotic origins, whereby proto-humans in Africa were infected by an ape simplexvirus [5][6][7] . While the tMRCA of HSV-2, the split of lineage A and B, may represent the age of HSV-2 in humans this divergence may have occurred sometime after establishing in humans. The divergence within HSV-2 may reflect human migration within Africa, however, as HSV-2 likely infected humans the tMRCA almost certainly postdates the origin of modern humans. Therefore it is unlikely that the true substitution rate is slower than 2.4 × 10 -8 substitutions/site/year, which produce a of 250 kya which is older than evidence of anatomically modern humans 10 . Thus HSV-2 migrating with modern humans during the original out-of-Africa migration between 50 and 100 kya 10 , is not consistent with the age of HSV-2 worldwide diversity.
We conclude that a scenario in which HSV-2 left Africa via the trans-Atlantic slave trade between 1500 and 1850 CE is inconsistent with both molecular clock and phylogeographic inference. Out-of-Africa migrations events more recently than ca. 1300 CE would necessitate a viral substitution rate that would produce some clocklike signal using tip-date calibrated molecular clock (i.e., faster than 1.8 × 10 −6 substitutions/site/year). Our clock-signal breakpoint analysis ( Supplementary Fig. 3) indicates that an out-of-Africa event postdating 1083 CE (i.e., substitution rate 1.3 × 10 −6 substitutions/site/year) would necessitate a molecular clock that could be inferred from the dataset. Further, any internally consistent molecular clock rate places the origin of this out-of-Africa migration in East Africa, on the opposite side of the continent from where the majority of Africans enslaved as part of the trans-Atlantic slave trade originated. We acknowledge that it is possible that HSV-2 was subsequently imported into the Americas via the trans-Atlantic slave trade long after the worldwide dispersal, given that there is evidence of other viral pathogens being transported to the Americas from Africa during this time 13 . Nonetheless, we find that both the timing and geographical source of the HSV-2 worldwide dispersal are incompatible with the slave trade hypothesis. A possible explanation for the previous inference of a young age of the HSV-2 out-of-Africa dispersal 11 is the misapplication of substitution rate correction for purifying selection. As viruses evolve for long periods of time, purifying selection can lead to underestimation of branch lengths due to substitution saturation 29,30 . In RNA viruses, this effect is most prominent on internal branches >0.1 substitutions/ site 5,30 , which is an order of magnitude longer than the total tree height of HSV-2 (0.0083 substitutions/site). Bayesian approaches have been developed to explore deeper divergence events in virus phylogenies 5,31 . Additionally, rescaling branch lengths by a power law decay rate can be used to correct for the underestimation 17 . We note that a refinement of the formulation for the power law decay rate model 32 now describes rate variation through time with a sigmoid curve, acknowledging a more biologically plausible constant evolutionary rate near the tips of viral evolutionary trees with rate decay occurring only deeper in the tree. In other words, although rate decay is almost certainly relevant to deeper divergence events within the simplexvirus phylogeny that show evidence of saturation 5 , HSV-2 is young enough that traditional maximum likelihood phylogenetic inference would not underestimate branch lengths, and applying a power law decay rate would inappropriately underestimate the evolutionary time along recent branches.
Previous analysis of HSV-2 geographic structure has primarily partitioned samples into continent-based regions 11,20,21 . Using our novel genome sequences, we were able to increase the resolution of geographic analysis, and show that East Africa is the geographic source of HSV-2 worldwide diversity. However, we note the exact location and path the HSV-2 took upon leaving Africa is still unclear because of (i) the lack of samples from the Horn of Africa which is the likely source of human out-of-Africa migration, (ii) the sampling bias for viral genomes from North America, and (iii) human migration in the modern era which obscures ancient signal. Increased sampling across global regions could improve geographic signal of HSV-2 migration.
HSV-2 genome samples on the order of thousands of years should provide sufficient temporal signal to estimate the true clock rate of HSV-2. In our simulations, genomes sampled over a 47-year timespan can accurately calibrate a clock rate of 1.3 × 10 −6 substitutions/site/year and faster. Hence HSV-2 genomes thousands of years old should have sufficient accumulation of signal to acutely infer a clock rate based on tip dates with a rate of 10 −8 substitutions/site/year. The identification of Mean of absolute value of inferred rates Absolute value of negative rate inferred Positive rate inferred Fig. 3 | Performance of clock rate inference on simulated data. A Inference of clock rate for a fixed input tree with simulated sequences and empirical dates. Black dot represents estimates of positive rates, a blue x represents the absolute value of estimates of negative rates, and red dots indicate the mean of absolute value for each fixed rate. Gray line indicates parity of simulated and estimated rate. B Negative clock rate analysis of the fraction of total simulations for a fixed rate that have a rate estimated as negative. Gray shading represents range for which a negative estimate is equally probable as positive (two-sided binomial test p < 0.05). C R 2 coefficient of determination of root-to-tip regression, a transparent dot represents one replicate. 200 replicates of simulated sequences were generated under GTR + F + Γ 4 model then clock rate estimated on simulated sequences; code to produce the simulation results is at https://github.com/jennifer-bio/HSV2. ancient herpes virus sequences (including HSV-1) in mummified archeological specimens 28,33 suggests the possibility of reconstructing ancient HSV-2 genomes that could be used to inform a tip-dated calibration for this virus and to reveal the ancient paths of HSV-2 transmission.

HSV-2 data set and sequence alignment
Clinical isolates were recovered from patients diagnosed with genital herpes at La Pitié Salpêtrière-Charles Foix University Hospital (Paris, France) who self-reported overseas origin. Due to the observational and retrospective nature of the study, the need for specific informed consent from individual patients was not required according to French law. All data generated in the present study had no impact on the routine standard clinical management of the patients. An informed consent to the treatment of personal data was acquired upon hospital admission from the patients or their legal representatives. HSV-2 isolates were used because of shortages of primary clinical sample material. Isolates were obtained by propagation in subconfluent Vero cell monolayers 34 ). A limited number of passages was necessary for the generation of viral stocks.
We sequenced 65 HSV-2 samples collected 2012-2018 primarily from West and Central Africa. In brief, we extracted DNA from supernatant using the QIAmp Viral RNA Mini Kit (Qiagen, Hilden, Germany) and measured DNA content with a Qubit (Thermo Fischer Scientific, Waltham, MA, USA). HSV-2 copy numbers were determined using a quantitative PCR (qPCR) assay 8 . We then prepared dual indexed libraries using 1 ng DNA extract using the Nextera XT Library Preparation Kit and the Nextera XT Index Kit (Illumina, San Diego, CA, USA), which we pooled so individual libraries would all bring in about the same number of HSV-2 genome copies (according to the qPCR results). The pool was subjected to a double-round hybridization capture with a MYbaits Custom Target Enrichment Kit designed to cover the whole genomes of HSV-1 (NC_001806), HSV-2 (NC_001798), and varicella-zoster virus (VZV, NC_001348; Human herpesvirus 3) (Daicel Arbor Biosciences, Ann Arbor, MI, USA). For both rounds, we followed manufacturer's instructions with the exception that only onefourth of the recommended bait quantity was used. We sequenced the capture product on MiSeq and NextSeq 550 platforms (Illumina) using the MiSeq Reagent Kit v3 (600 cycles; Illumina) and NextSeq Mid Output Kit v2.5 (300 cycles; Illumina), generating a total of 61,926,628 reads.
Low quality bases and adapter sequences were removed from raw reads using Trimmomatic v0.36 35 and overlapping reads were merged using ClipAndMerge v1.7.8 36 . We then mapped trimmed reads to the HSV-2 RefSeq genome (NC_001798) from which we had removed the TRL and TRS regions 37 . We finally sorted mapping files and removed duplicate reads using the SortSam and MarkDuplicates tools from Picard v2.7.1 (http://broadinstitute.github.io/picard/). We generated consensus sequences from the maps using Geneious Prime v2021.2.2 38 , calling bases at positions covered by at least 20 unique reads and for which more than 95% of the reads were in agreement.
Additionally, we downloaded all available HSV-2 sequences of length greater than 100kbp from GenBank in November 2019. Potentially engineered (n = 5) and dubious sequences (n = 10) were removed. HSV-2 sequences along with, ChHV (NC_023677) and HSV-1 (NC_001806) were aligned with MAFFT v7.307 39 . Regions with more than 10% of sequences containing gaps were removed with Geneious. Invariant sites were stripped from the alignment to produce a dataset that was computationally tractable for recombination analysis, resulting in an alignment of 25,809 variant sites of ChHV, HSV-1, and HSV-2 sequences. Recombinant fragments were inferred with RDP4 40 , using five recombination detection methods (RDP, GENECONV, MaxChi, BootScan, and SiScan) and validating recombination events identified by at least two methods. In sequences found to contain recombination, the recombinant fragments within these sequences were masked (with Ns) in the complete alignment (including variant and invariant sites) for subsequent phylogenetic analysis.  A maximum likelihood phylogeny for the HSV-2 dataset was made using IQtree2 v2.0.4 19 , with a GTR + F + R4 substitution model selected by model testing using ModelTesting with cAIC on the initial alignment 41 . To test uncertainty in the topology we used aBayes to estimate branch support 42 . We used TreeTime v0.7.6 22 to perform molecular clock inference on HSV-2 using tip date calibrations. Default TreeTime parameters were used, except for fixing the root, using input branch lengths, and clock filter was set to 0 (no filtering). The root was fixed, as this rooting is supported by including ChHV as an outgroup and is consistent with previous studies 8 . Input topology and branch lengths were used to because joint optimization is not stable without this approach 43 . With the lack of clock like signal it is was not appropriate to filter based on clock variation.

Likelihood of fixed clock rates with empirical data
The log likelihood was calculated for time trees with a range of fixed substitution rates using TreeTime. We used a fixed input tree, empirical sample dates, and no clock filter. The rates used were 10 −9 substitutions/site/year to 10 −3 substitutions/site/year, with a resolution of 8 rates distributed evenly across the log space across each order of magnitude. This distribution exceeds the range of previous clock rate estimates for simplexvirus and HSV-1 11,16 .
To validate the interpretation of the likelihood surface, we performed a similar process with a previously published dataset of 1,610 ebolavirus sequences from the 2014 to 2015 West African Ebola epidemic 23 , which has detectable temporal signal using tip-date calibrations. The Ebola dataset was analyzed with rates from 10 −1 substitutions/site/year to 10 −5 substitutions/site/year, with a resolution of three rates for each order of magnitude spread evenly across log space ( Supplementary Fig. 2).

Clock rate estimates of simulated data
In order to understand what clock rate would result in sufficient temporal signal to detect the molecular clock on the empirical HSV-2 dataset (i.e., range of sampling dates, sequence length, and tree topology), we simulated sequence data across a range of substitution rates, 10 −9 to 10 −3 substitutions/site/year with eight rates across each order of magnitude. Simulation was done with the nucleotide substitution model parameters inferred by IQtree2, under a GTR + F + Γ4 model on the HSV-2 dataset. We first inferred a time tree with a fixed clock rate based on our topology using TreeTime and then scaled this tree to substitutions/site with the clock rate. Sequences were simulated on the rescaled tree with Seq-Gen v1.3.4 44 . We performed 200 replicates of sequence evolution and subsequent analysis.
Branch lengths were estimated on the fixed empirical topology using the simulated sequences with IQtree2 using GTR + F + Γ 4 model. Although the empirical tree inference was performed with free rate variation, we used gamma rate variation in the simulations and subsequent tree inference to maintain consistency between simulation, tree, and clock inference. The clock rate and R 2 of the root-tipregression of the simulated data were then estimated with TreeTime, using previous parameters.
The detection of temporal signal was assessed through different methods using R 45 . The rate at which there was a change in the relationship between the mean absolute value of estimated rate and the simulated rate was inferred by a linear regression break point analysis (R segmented v. 1.2-0).
The point at which the estimated rate reflected the empirical data, in that the estimated clock rate was negative, was determined by two cutoffs. The first cutoff method is the fastest rate at which more than 5% of estimated rates on simulated data were negative. The second cutoff is the fastest rate at which a negative estimate is as likely as a positive estimate, determined by binomial test (p < 0.05).

Phylogeographic ancestral state reconstruction
We performed a migration analysis, which is an inference of transition between discrete geographic regions along branches of the phylogenetic tree, to reconstruct the ancestral geographic state. We used TreeTime migration with default parameters, empirical time trees built with different fixed clock rates and the region of sampling for each tip. Samples with country histories were partitioned into regions, based on the United Nations M49 Standard, into North Africa, West Africa, East Africa, Central Africa, Southern Africa, and countries from all other regions were grouped into non-African region. Six samples had no country of origin, but are known to be sampled from Africa, and were partitioned into a broad Africa, region unknown for phylogeography analysis. Additionally, four samples were from an unknown location and were indicated as missing in the migration analysis.
Robustness analysis of the phylogeographic reconstruction was done by repeating previous migration analysis with a merged West, Central, and Southern Africa region. Additional robustness analysis split the non-African region into worldwide subregions, and reconstruction was done with North Africa, West Africa, East Africa, Central Africa, Southern Africa, Africa region unknown, Eastern Europe, Southern Europe, Western Europe, Northern America, South America, Southeastern Asia, Western Asia, and missing.
The points of particular interest in the trees are first the age of the root, which represents the age of HSV-2 diversification, and second the age of the MRCA of the worldwide diversity within the worldwide lineage, which represents the maximum age for the dispersal of HSV-2 out-of-Africa, assuming an African origin for the lineage B. On the fixed topology, ancestral geographic reconstruction along the basal backbone of the worldwide lineage had majority of support for the African region that is the source of HSV-2 worldwide diversity which decreases until the majority of support is for the non-African region (Supplementary Figure 5). We defined the-out of-Africa age as the mean age of the nodes that define the backbone at the base of lineage B, weighted by a factor of one minus the absolute value of the difference of the support for source African region and support for non-African region. This factor accounts for the uncertainty which point on the maximum likelihood phylogeny that HSV-2 migrated out-of-Africa.
Age is reported in number of years ago, relative to 2018, the most recent sample date.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All sequence data generated in this study are made available in Bio-Project under accession code PRJEB48656. All accession numbers for data used in this manuscript can be found in Supplementary Data 1. Code for running simulation is available at https://doi.org/10.5281/ zenodo.7054908.

Code availability
The latest code for running simulation is available at https://github. com/jennifer-bio/HSV2, code used to run simulation for this manuscript can be referenced at https://doi.org/10.5281/zenodo.7054908.